-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Add streaming + parallel CSV reader, with decompression support. #1501
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #1501 +/- ##
==========================================
+ Coverage 74.74% 74.76% +0.01%
==========================================
Files 60 60
Lines 6118 6130 +12
==========================================
+ Hits 4573 4583 +10
- Misses 1545 1547 +2
|
c29cbbe
to
46ec9ef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! And you were right haha, I think we should probably pull out the estimated rows commit until after we do the MicroPartition and scan operator work.
// default to Utf8 for conflicting datatypes (e.g bool and int) | ||
DataType::Utf8 | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any merging we have to do the the temporal types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming that we'll have a variety of follow-ups there, yes.
schema, | ||
// Default buffer size of 512 KiB. | ||
buffer_size.unwrap_or(512 * 1024), | ||
// Default chunk size of 64 KiB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these constants make sense given its a local buffered file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a TODO to benchmark locally and tweak these, but I'm assuming that tweaking these won't matter as much for local reads as they do for cloud reads. I can look into tweaking these if you'd like!
This reverts commit 8acf496.
@samster25 I reverted the estimated row size piping from schema inference, added a good bit more test coverage, and addressed your primary review comments, PTAL! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! However we should also test everything through the python side as well via Dataframe tests if that isn't already done!
…t are likely to not require reallocation.
This PR adds streaming + parallel CSV reading and parsing, along with support for streaming decompression. In particular, this PR:
The streaming + parallel reading + parsing leads to a 4-8x speed up over the pyarrow reader and the previous non-parallel reader when benchmarking large file (~1 GB) reads, while also resulting in lower memory utilization due to the streaming reading + parsing.
TODOs (follow-up PRs)